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We consider the problem of constructing fast and small binary adder cir¬ 
cuits. Among widely-used adders, the Kogge-Stone adder is often considered 
the fastest, because it computes the carry bits for two n-bit numbers (where 
n is a power of two) with a depth of 21 og 2 n logic gates, size 4nlog2 n, and all 
fan-outs bounded by two. Fan-outs of more than two are avoided, because they 
lead to the insertion of repeaters for repowering the signal and additional depth 
in the physical implementation. 

However, the depth bound of the Kogge-Stone adder is off by a factor of 
two from the lower bound of log 2 n. This bound is achieved asymptotically 
in two separate constructions by Brent and Krapchenko. Brent’s construction 
gives neither a bound on the fan-out nor the size, while Krapchenko’s adder 
has linear size, but can have up to linear fan-out. With a fan-out bound of two, 
neither construction achieves a depth of less than 2 log 2 n. 

In a further approach, Brent and Kung proposed an adder with linear size 
and fan-out two, but twice the depth of the Kogge-Stone adder. 

These results are 33-43 years old and no substantial theoretical improvement 
for has been made since then. In this paper we integrate the individual advan¬ 
tages of all previous adder circuits into a new family of full adders, the first 
to improve on the depth bound of 2 log 2 n while maintaining a fan-out bound 
of two. Our adders achieve an asymptotically optimum logic gate depth of 
log 2 n + o(log 2 n) and linear size 0{n). 
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1 Introduction 


Given two binary addends A = (an ... ai) and B = (bn ... 6i), where index n denotes the 
most significant bit, their sum S = A + B has n +1 bits. We are looking for a logic circuit, 
also called an adder, that computes S. Here, a logic circuit is a non-empty connected acyclic 
directed graph consisting of nodes that are either gates with incoming and outgoing edges, 
inputs with at least one outgoing edge and no incoming edges, or outputs with exactly one 
incoming edge and no outgoing edges. Gates represent one or two bit Boolean functions, 
specifically And, Or, Xor, Not or their negations. A small example is shown on the right 
side of Figure [Tal The fan-in is the maximum number of incoming edges at a vertex, and 
it is bounded by two for all gates. 

The main characteristics in adder design are the depth, the size, and the fan-out of a 
circuit. The depth is defined as the maximum length of a directed path in the logic circuit 
and is used as a measure for its speed. The lower the depth, the faster is the adder. The 
size is the total number of gates in the circuit, and is used as a measure for the space 
and power consumption of the adder, both of which we aim to minimize. The fan-out is 
the maximum number of outgoing edges at a vertex. High fan-outs increase the delay and 
require additional repeater gates (implementing the identity function) in physical design. 
Thus, when comparing the depth of adder circuits, their fan-out should be considered as 
well; we will focus on the usual fan-out bound of two. Circuits with higher fan-outs can be 
transformed into fan-out two circuits by replacing each interconnect with high fan-out by a 
balanced binary repeater tree, i.e. the underlying graph is a tree and all gates are repeater 
gates. However, this increases the size linearly and the depth logarithmically in the fan-out. 
Hoover, Klawe, and Pippenger [1984] gave a smarter way to bound the fan-out of a given 
circuit, but it would also triple the size and depth in our case of gates with two inputs. 

Using logic circuit depth as a measure for speed is a common practice in logic synthesis 
that simplifies many aspects of physical hardware. In CMOS technology, Nand/Nor gates 
are faster than And/Or gates and efficient implementations exist for integrated multi¬ 
input AND-OR-Inversion gates and OR-AND-Inversion gates. We assume that a technology 
mapping step [Chatterjee et al. 2006 IKeutzer 1988] translates the adder circuit after logic 
synthesis using logic gates that are best for the given technology. Despite its simplicity, 
the depth-based model is at the core of programs such as BonnLogic [Werber et al. 2007] 
for refining carry bit circuits, which is an integral part of the current IBM microprocessor 
design flow. Recently, we reduced the running time for computing such carry bit circuits 
significantly from 0(n^) to O(nlogn) [Held and Spirkl 2014| . Exemplary, for all newly 
proposed adder circuits in this paper we will demonstrate how to efficiently transform 
them into equivalent circuits using only Nand/Nor and Not gates. 

Like most existing adders, we use the notion of generate and propagate signals, see 
[Sklansky 1960 IBrent 19701 IKnowles 1999| . For each position 1 < i < n, we compute a 
generate signal yi and a propagate signal Xi, which are defined as follows: 

k. (jj 
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where A and 0 denote the binary And and XOR functions. The carry bit at position i + 1 
can be computed recursively as Cj+i = yiV {xi A Cj), since there is carry bit at position i + 1 
if the i-th. bit of both inputs is 1 or, assuming this is not the case, if at least one (hence 
exactly one) of these bits is 1 and there was a carry bit at position i. 

The hrst carry bit ci can be used to represent the carry-in, but we usually assume ci = 0. 
The last carry bit Cn+i is also called the carry-out. From the carry bits, we can compute 
the output S via 

Si = Ci® Xi lor 1 < i < n and s^+i = Cn+i- (2) 

With this preparation of constant depth, linear size, and fan-out two at the inputs ai,bi 
and fan-out one at the carry bits Cj+i {i = 1,... ,n), the binary addition is reduced to the 
problem of computing all carry bits Cj+i from Xi,yi {i = 1,... ,n). 

Convention: From now on, we will omit the preparatory steps ([7]) and m and consider 
a circuit an adder circuit if it computes all Cj+i from Xi,yi (i = 1,... ,n). 

Expanding the recursive formula for Cj+i as in equation ([3]) results in a logic circuit that 
is a path of alternating And and Or gates. It corresponds to the long addition method 
and has linear depth 2(n — 1). 


Ci+i = 7/i V (xj A (yi_i V (xi_i A • • • A (?/2 V (X2 A yi))....))) (3) 


1.1 Prefix Graph Adders 

For two pairs Zi = {xi,yi) and Zj = {xj,yj), we define the associative prefix operator o as 


f f _ f Xi Axj \ 

\yj [yjJ [yiV(xiAyj)J 

A circuit computing (j4]) can be implemented as a logic circuit consisting of three gates and 
with depth two as shown in Figure [Tal For i = 1,..., n, the result of the prefix computation 
Zi o ■ ■ ■ o zi of the expression Zn o ■ ■ ■ o zi contains the carry bit Ci+i: 


Xi A Xi-i A ■ 
Q+1 


A Xi 



(5) 


A circuit of o-gates computing all prefixes Zi o ■ ■ ■ o zi (f = 1,..., n) for an associative 
operator o is called a prefix graph. A prefix graph yields an adder by expanding each o-gate 
as in Figure [Tal and extracting the carry bits as in Q. 

Most previous constructions for adders are based on prefix graphs of small depth, size 
and/or fan-out. Sklansky [1960] developed a prefix graph of minimum depth log 2 n., size 
^nlog 2 n, but high fan-out ^n+1. The first prefix graph with logarithmic depth (21ogn —1) 
and linear size (3n — logn — 2) was developed by Ofman [1962], exhibiting a non-constant 
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(a) Prefix gate and underlying logic circuit 


Zs Z7 ZQ Z5 Zi Z3 Z2 Zl 



Figure 1: Prefix gate and graph 


fan-out of ^logn. Kogge and Stone [1973] introduced the recursive doubling algorithm 
which leads to a prefix graph with depth log 2 n and fan-out two (see Figure [Tb]). Since we 
will use variants of it in our construction, we describe it in detail. For 1 < s < t < n, let 
Zg^t '■= zto- ■ - ozg, and for x G M, let (x)+ := max{x, 0}. The graph has log 2 n levels, and on 
level i it computes for every input j (1 < j < n) the prefix Z^_^^._^i^+ . = ZjO- ■ ■ o 
according to the recursive formula 

Zl+[j-2i)+,j = ^l+(2-2*-l)+,i O ■Z'i+(j_2i)+,(j-2'-l)+, (6) 

from the prefixes of sequences of 2*“^ consecutive inputs computed in the previous level. 
The fan-out is bounded by two, since every intermediate result is used exactly twice: once as 
the “upper half” and once as the “lower half” of an expression of the form ZjO- ■ 

Note that for level i {1 <i < log 2 n), a repeater gate (which computes the identity function) 
is used instead of a o-gate if j < 2*, i. e. in the case that the right input in @ is empty. 
Repeaters are shown as blue squares in Figure [Tbl The Kogge-Stone prefix graph minimizes 
both depth and fan-out. On the other hand, since there is a linear number of gates at each 
level, the total size in terms of prefix gates is nlog 2 n — 

Ladner and Fischer [1980] constructed a prefix graph of depth log 2 n but high fan-out. 
Brent and Kung found a linear-size prefix graph with fan-out two, but twice the depth 
of the other constructions. Finally, Han and Carlson [1987] described a hybrid between a 
Kogge-Stone adder and a Brent-Kung adder which achieves a trade-off between depth and 
size. Lower bounds for the trade-off between the depth and size of a prefix graph can be 
found in [Fich 198^ jSergeev 2013] . 

The above prefix graphs can be used for prefix computations with respect to any associa¬ 
tive operator o. In fact, we will later use a prefix graph in which the operator o represents 
an And gate. When turning one of the above prefix graph adders into a logic circuit for 
addition such that each prefix gate is implemented as in Figure fTal the depth of the logic 
circuit is twice the depth of the prefix graph and the number of logic gates is three times 
the number of prefix gates. The fan-out of the underlying logic circuit can increase by one 
compared to the prefix graph, because the left propagate signal Xi is used twice within a 
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prefix gate. In Section 13.11 we will see that in the case of the Brent-Kung adder a fan-out 
of two can be achieved by using reduced prefix gates. 

Any adder constructed from a prehx graph has a logic gate depth of at least log^ n — 

1 > 1.441og2n — 1, where ip = is the golden section [Held and Spirkl 2014| , see 

also [Rautenbach et al. 2008] . In [Held and Spirkl 2014 an adder of size 0(nlog2 log 2 n) 
asymptotically attaining this depth bound is described, however with a high fan-out of 
/n -|- 1. 


1.2 Non-Prefix Graph Adders 

Since none of the 2n inputs {1 < i < n) except for xi are redundant for c^+i, 

the depth of any adder circuit using 2-input gates is at least log 2 n -|- 1, which would be 
attained by a balanced binary tree with inputs/leaves Xi,yi {1 < i < n). With adders that 
are not based on prefix graphs, this bound is asymptotically tight. Krapchenko showed 
that any formula (a circuit with tree topology) for computing Cn+i has depth at least 
log 2 n-|-0.15 log 2 log 2 log 2 n-|-0(1) [Krapchenko 2007] . 

Brent [1970] gives an approximation scheme for a single carry bit circuit attaining an 
asymptotic depth of (1 -|- e) log 2 n -|- o(log 2 n) for any given e > 0. The best known depth 
for a single carry bit circuit is log 2 n-|-log 2 log 2 n-|-0(l), due to Grinchuk [2008]. However, 
[Grinchuk 2008] and [Brent 1970] did not address how to overlay circuits for the different 
carry bits to bound the size and fan-out of an adder based on their circuits. One problem 
in sharing intermediate results is that this can create high fan-outs. 

Krapchenko [1967] (see [Wegener 1987[ pp. 42-46]) presented an adder with asymptot¬ 
ically optimum depth log 2 n -|- o(log 2 n) and linear size. It was refined for small n by 
[Gashkov et al. 2007) . However, the fan-out is almost linear. 

1.3 Our Contribution 

In this paper, we present the first family of adders of asymptotically optimum depth, linear 
size, and fan-out bound two: 

Theorem 1.1 (Main Theorem). Given two n-bit numbers A,B, there is a logic circuit 
computing the sum A B, using gates with fan-in and fan-out two and that has depth 
log 2 n o(logn) and size 0{n). 

The rest of the paper is organized as follows. In Section [2l we develop a family of adders of 
asymptotically minimum depth, fan-out two, but super-linear size O Y^log 2 n] ^ 

In Section [S] using reductions similar to [Krapchenko 1967] , this adder is transformed into 
an adder of linear size with the asymptotically same depth, proving Theorem 11.11 While 
all of the above adders use only And/Or gates and repeaters, we show in Section [4] that 
Theorem 11.11 holds also if only Nand/Nor and Not gates are available. 
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2 Asymptotically Optimum Depth and Fan-Out Two 


For 1 < s < t < n, let Xg^t and Yg^t denote the propagate and generate signal for the 
sequence of indices between s and t, i.e. 

Xg^t /\i=s^i 

ys,t =yty (xt A {yt-i V {xt-i A • • • A {yg+i V (x^+i A yg ))...))) 

The adders based on prefix graphs as in Section [1.11 impose a common topological struc¬ 
ture on the computation of intermediate results Xg^t and Yg^f In the adder described by 
Brent [1970], on the other hand, intermediate results Xg^t and Ys,t are computed separately 
within larger blocks. 

Let n = 2'"^ for r G N and /c G N to be chosen later. A central idea of generating a faster 
adder is to use multi-fan-in (also called high-radix) subcircuits within a Kogge-Stone prefix 
graph. While all the prehx gates in Figure [TbI have fan-in two, we want to use prefix gates 
with fan-in 2^, so that the number of levels reduces from log 2 n to log 2 r n = ^ log 2 n. Each 
prefix gate with fan-in 2^ represents a logic circuit with fan-in and fan-out bounded by two. 
Since the output of each prefix gate will be used in 2^ prefix gates at the next level, our 
approach also requires to duplicate the intermediate result at the output of a prefix gate 
2^“^ times. To accomplish this, we consider the computation of generate and propagate 
sequences separately. 

Our adder consists of two global Kogge-Stone type prefix graphs. The first such graph 
uses 2-input AND-gates and computes propagate signals used in the other prefix graph. 
This graph uses 2^-input subcircuits that are arranged in the same way as the Kogge-Stone 
graph, and it computes the generate (carry) signals. Both graphs are modified to duplicate 
intermediate generate signals 2'"“^ times and intermediate propagate signals 2^ times so 
that the overall constructions obeys the fan-out bound of two. 

2.1 Multi-Input Generate Gates 

We now introduce multi-input generate gates, which are the main building block for com¬ 
puting the generate signals. Given 2'" propagate and generate pairs {x 2 '^,y 2 ^), ■ ■ ■, {xi,yi), 
a multi-input generate gate computes the generate signal 

id,2’- = V {x2r A V (x2’--i a • • • a (^2 v (x2 a §1 ))...))). 

The input pairs {xi,yi) {i G {!,... ,2’’}) are not necessarily the input pairs of the adder; 
they can be intermediate results. 

Each multi-input generate gate has 2'’“^ outputs, each of which provides the result Fi, 2 ’', 
because later we want to reuse this signal 2^ times and bound the fan-out of each output 
by two. In contrast to two-input prefix gates computing ([4]), multi-input generate gates 
do not compute the propagate signals Ai^ 2 '- = Ai=i Xi for the given input pairs. All 
required propagate signals will be computed by the separate AND-prefix graph, described 
in Section [T2J 
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Figure 2: A 2'’-input 2'’ ^-output generate gate for r = 3 


Figure [2] shows an example of a multi-input generate gate with 8 inputs. A 2'’-input 
prefix gate computes as in the disjunctive normal form 

2 ’' / / 2 - \\ 

yi,2- = V % ^ A ’ 

j=i \ \i=i+i / / 

first computing all the minterms mj := yj A {j = 1,...,2^), and then the 

disjunction The terms Ai=j+i are computed as a Kogge-Stone AND-suffix 

graph, which arises from a Kogge-Stone prefix graph by reversing the ordering of the 
inputs. A single stage of (red) AND-gates and one repeater concludes the computation 
of the minterms. Each input yi is used exactly once within this circuit. The repeater is 
dispensable but simplifies the size formula and will become useful in Section 31 

Finally, instead of computing the disjunction V^i by a balanced binary Or tree and 
duplicating the results 2^“^ times through a balanced repeater tree, the duplication is ac¬ 
complished by r rows of 2’’“^ OR-gates as shown in Figured Formally, let Mjj = Vi'=i 
be the conjunction of minterms f, ..., j. Then, on level I G {!,..., r}, we compute each 
signal of the form Afj 2 i+i^(j+i) 2 A i = 0,— from the previous level, and we compute 
copies of it. By using (*+ 1 ) 2 * = ^ 2 i 2 ‘-i+i,( 2 i+i) 2 ‘-iVMi- 2 i+i) 2 i-i+i,( 2 i+ 2 ) 2 ‘-i, and 

since each preceding signal is available 2^~^ times (I > 2), we can ensure that each of them 
has fan-out two. On the last level, we will have computed 2^~^ copies of Mi^ 2 ^ = Ki, 2 ''- Each 
level uses 2’’“^ OR-gates. Note that a similar construction for reducing fan-out has been 
used by Lupanov when extending his well-known bounded-size representation of general 
Boolean functions to circuits with bounded fan-out [Lupanov 1962 . 
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Lemma 2.1. The multi-input generate gate has 2^ generate/propagate pairs as input and 
2^“^ outputs. Each propagate input has fan-out two and each generate input has fan-out 
one. The gate consists o/r2^ + (r +1)2^“^ logic gates which have fan-out at most two. The 
depth for the propagate inputs Xi is 2r + 1 and the depth for the generate inputs pi is r + 1 
riG{i,...,2-};. 

Proof. All the terms Ai=j+i computed as a Kogge-Stone AND-suffix graph (blue and 

yellow gates in Figure [2]) of size 

2'Tlog2 2n-y = (r-l)2^ + 2’'-i. 

Then, there is a level of 2^ (red) And gates and one repeater, concluding the computation 
of the minterms. Finally, there are r2'’“^ (green) OR-gates to compute the disjunction 
Vj=i 2^ times, for a total of 

r2’' + (r + l)2^-i 

gates. By construction, no gate and propagate input has fan-out larger than two, and all 
generate inputs have fan-out one. The depth is r for the AND-suffix graph, one for the red 
gates, and r for the disjunctions, yielding the desired depths of 2r -|- 1 for the propagate 
inputs and r -|- 1 for the generate inputs. □ 

2.2 Augmented Kogge-Stone And-Prefix Graph 

The second important component of our construction is the augmented Kogge-Stone And- 
prefix graph. It is used to compute Xg^t = Ai=s all 1 < t < re and s = 1 -|- (t — 2^*)"*" 

with 0 < I < k, providing each output 2'’ times through 2'’ individual gates. It is constructed 
as follows. First, we take a Kogge-Stone [1973] prefix graph, where the prefix operator is an 
AND-gate, i.e. o = A. It consists of log 2 re levels, and on level i it computes for every input 
j (1 < j < re) the prefix j from the prefixes of sequences of 2*“^ consecutive 

inputs computed in the previous level. 

Each of the results Xgg from level rl will later be used in 2^ multi-input generate gates 
for all 0 < Z < /c, s = 1 -|- (t — 2^^)^ and 1 < t < re. In order to achieve a fan-out bound 
of two, starting at the inputs, we insert one row of re repeaters after every r levels of 
AND-gates. This allows to use the repeaters as the inputs for the next level, and to extract 
the signals Xg^t once at the AND-gates before the repeaters. The construction is shown in 
Figure [3] with the extracted outputs Xg^t shown as red arrows. The last block of r rows of 
gates (hatched gates in Figure [3|) of the Kogge-Stone prefix graph can be omitted in our 
construction to reduce the size. 

Each output signal Xg^t will be input to a multi-input generate gate, where it is immedi¬ 
ately duplicated. Thus, each output Xg^t of the augmented Kogge-Stone AND-prefix graph 
has to be provided through an individual gate. To this end, at each of the nk outputs, we 
add 2^+^ — I repeater gates as the vertices of a balanced binary tree to create 2^ copies 


Xie X 15 Xu X 13 X 12 Xu Xio Xg Xg Xj Xq X 5 Xi Xg X 2 Xi 



Figure 3: Augmented Kogge-Stone AND-prefix graph for r = k = 2. 


of the signal with a single repeater serving each leaf. For simplicity these repeaters are 
hidden in Figure [3l 

Lemma 2.2. The total size of the augmented Kogge-Stone AND-pre/ix graph is nr{k — 1) + 
nA:2'’+i. 

Proof. Each binary repeater tree at one of the nk outputs consists of — 1 repeaters, 
summing up to nk{2''~^^ — 1) repeaters in these repeater trees. The remaining circuit consists 
of r{k — l) + k rows {r{k — 1) rows of AND-gates and k rows of repeaters) of n gates each, 
summing up to n{r{k — 1) + k) gates. Altogether, the circuit contains nr{k — 1) + nk2'^^^ 
gates. □ 

Lemma 2.3. The signal Xg^t for 1 <t <n and s = l-\- (t — 2’’^)^ for 0 < I < k is available 
2^ times with internal fan-out one at a depth of {I + l)(r + 1). 

Proof. The functional correctness is clear by construction. For the depth bound, let 1 < 
t < n and 0 < I < k. Then, for s = 1 + (t — 2''^)'*', the signal Xg^t is available at the bottom 
of the l-th block at a depth of Z(r + 1). Subsequently, we create 2^ copies of the signal in 
a repeater tree of depth r + 1. Together, this gives the desired depth (I + l)(r + 1). □ 

2.3 Multi-Input Generate Adder 

We now describe the multi-input generate adder for n = 2^^. It consists of an augmented 
Kogge-Stone AND-prefix graph from the previous section and a circuit consisting of multi¬ 
input generate gates similar to a radix-2^ Kogge-Stone adder. 

The construction uses k rows with n multi-input generate gates or repeater trees (see 
Figured!). The t-th multi-input generate gate in level ^ E {1,..., fc} computes ^ 
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Figure 4: Multi-input multi-output generate gate adder for r = fc = 2 


according to the formula Y 




2 ’" / 
i=i V 


\ k = j+l ) 


( 8 ) 


If (t —2'’^)'*' < (t — (yellow circuits in Figure 0]), this computation is carried 

out using a multi-input generate gate from Section 12.11 As its inputs, it uses generate 
signals from the previous level, / — 1, and propagate signals obtained from the augmented 
Kogge-Stone AND-prefix graph. 

Except for the last level, each intermediate generate signal will be used T' times as in 
([5]) in the next level. As the fan-out of each generate input inside a multi-input generate 
gate is one, we need to provide 2^“^ copies through individual gates to serve 2'" multi-input 
generate gates with fan-out two. 

If (t — 2^^)^ ~ ~ (blue squares in Figure|l|), ^ is already computed 

in the previous level, and in this level it is sufficient to duplicate the signal 2'"“^ times using 
a balanced binary repeater tree. 

The augmented Kogge-Stone AND-prefix graph provides each signal 2^ times with indi¬ 
vidual repeaters. Thus, it can be distributed to 2^ multi-input generate gates, where the 
fan-out of each propagate input is two. 

For the first level of multi-input generate gates, we duplicate each generate signal yi 
at an input i G {1,... ,n} using a balanced binary repeater tree of depth r — 1 and size 
2 -|- 2^ = 2^ — 2. Again, we can distribute each copy to two multi-input generate 

gates, maintaining fan-out two. 

In the last level of multi-input generate gates, we do not need to duplicate the signals 
anymore. Instead of the r rows of 2^~^ Oa-gates each, we can compute the single outputs 
using a balanced binary tree of 2^ — I Oa-gates and depth r. 

Lemma 2.4. The multi-input generate adder for n = 2^^ bits obeys a fan-out bound of 
two, contains less than 

3nk{r -T 2)2^“^ -|- n2^ -|- nrk 
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gates, and has depth 


kr + 2r + k + 1. 


Proof. Inside each multi-input generate gate, the fan-out of propagate inputs is two and 
the fan-out of generate inputs is one. Thus, it suffices to observe that in each non-output 
level there are 2^ copies of each propagate signal and 2^“^ copies of each generate signal, 
and that the fan-out of two holds within the augmented Kogge-Stone graph and within 
each multi-input generate gate. 

By Lemma 12.21 the size of the augmented Kogge-Stone AND-prefix graph is nr(k — 1) -|- 
nk2'^~^^. The size of the n balanced binary trees duplicating the input generate signals is 
n(2^ -2). 

The remainder of the graph consists of k rows of n 2^-input multi-input generate gates 
or repeater trees. The size of a repeater tree (blue boxes in Figured]) is at most 2'’“^ — 1 < 
r2^ -|- (r -|- 1)2^“^ (r > 1), which is the size of a multi-input generate gate. Thus, the size 
of all these multi-input generate gates is at most nk (r2^ -|- (r -|- 1)2^“^). Summing up, the 
total size is at most 

nr{k — 1) -|- nk2^^^ + n{2'' — 2) + nk (r2’’ + {r + 1)2''“^) 

= nA:2^+^ -|- nkr2'' + n2^ -|- nk{r + 1)2''“^ -|- nkr — n{r -|- 2) 

= nfc (4 -|- 2r -|- (r -|- 1)) 2’"~^ + n2’" -|- nkr — n(r -|- 2) 

< 3nk (r -|- 2) 2^“^ -|- n2^ -|- nkr. 

For a simpler depth analysis, we assume that the input generate signals y* arrive delayed 
at a depth of r -|- 2. The generate input signals traverse a binary tree of depth r — 1 and 
the propagate input signals traverse a binary tree of depth r -|- 1 before reaching the first 
multi-input generate gate, i. e. generate signals yi become available at depth 2r -|- 1 and 
propagate signals at depth r -|- 1. Thus, the first row of multi-input generate gates has 
depth 

3r -|- 2 = max{2r -|-l-|-l-|-r, r-|-l-|-r-|-l-|-r}, 

where the first term in the maximum is caused by the delayed generate signals yi and the 
second term by the propagate signals Xj (1 < i < 1). 

For the next level, the propagate signals are available at time 2r + 2, and the generate 
signals at time 3r -|- 2, and the propagate signals again arrive r time units before the 
corresponding generate signals, so at the next level, both signals arrive r-|-l time units later 
than they did before. Inductively, we know that for each level 2 < I < k, the generate and 
propagate signals arrive at a depth of (/ — l)(r -|- 1) more than than they did for at the first 
level. Consequently, the total depth of the adder is (fc—l)(r-|-l)-|-3r-|-2 = A;r-|-2r-|-A:-|-l. □ 

If v^logn G N, we can choose r = k = y/\ogn and receive the following result. 
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Corollary 2.5. If y^logn E N, there is a multi-input generate adder for n bits with fan-out 
two, size at most 


3n(logn + 2y^logn)2'^^°®"' ^ + n log n, 


and depth 

logn + 3y^logn + 1. 


In general, ^/k)gn 0 N, and we get the following result. 


log2n + 5 i/logan 


T/ie size is at most 


if n > 16, and at most 


in 


8n 


viog2 


n 


uVi) (i G {I,--- 

, n}), there is a circuit 

2, depth at most 



+ 2. 


|'Vlog2«] 

(9) 

["^logj nj 

(10) 


\/log2 ra 

ifn< 15. 

Proof. We choose r = A: = [^log 2 n | and apply Lemma 12.41 obtaining 
‘ink{r + 2)2^'“^ + n2^ + nr/c = n (3(r^ + 2r)2''“^ + 2^ + 


( 11 ) 


Now, if n > 16, we have r = k > 2. Thus, we can use 2r < and 2^ + < r^2^ to bound 

the right hand side by 


n (3 (r^ + 2^ ^ + r^2'’) = 4nr^2’’, 


implying Q. 

Otherwise, n < 16, r = A: < 2, < 2r, < 2^, and the right hand side of (llip is 

bounded by 

n (3 (2r + 2r)) 2^"^ +2^ + 2^ = 8nr2^ < 8nr'^2'^, 

The resulting depth is 


A:r + 2r + A: + 1 


= [Vl og2H + 3 r Vlog2 ^1+1 

< (Lv^log 2 n\ + 1)^ + 3 [^log 2 n | + 1 

< log 2 n + 5\ v^log 2 n \ + 2. 


□ 
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If y^log 2 n 0 N, the adder in Theorem 12.61 is larger than necessary since it has n' = 

> n inputs. If for example n = 32, we choose r = A; = 3 and n' = 512. Thus, if 
I" y^log 2 n I > n + I"Y^log 2 n], choosing r = |~ Y^log 2 n | — 1 instead still yields an adder with 
at least n inputs and outputs and reduces the size and depth significantly. For n = 32, we 
would still obtain a 64-input adder using this method. 

The analysis can be refined further by noticing that the columns n' down to n +1 in the 
augmented Kogge-Stone AND-prefix graph and the multi-input gate graph can be omitted, 
since they are not used for the computations of the first n output bits. This reduces the 
size of the construction. If n' > n, we can omit the left half of the construction and notice 
that the right half of lowest row of multi-input generate gates only has 2^“^ inputs, so we 
can actually use 2''“^-input generate gates and reduce the depth by 1. This process can 
be iterated until n' = n, which decreases the rounding error incurred in Theorem 12.61 the 
depth is decreased by [ y^log 2 n | — log 2 n. 

In this section, we have achieved a depth bound of log 2 n+0{\/\og n) = log 2 n+o{\og 2 n), 
which is asymptotically optimal, since the lower bound is log 2 n. 


3 Linearizing the Size of the Adder 

To achieve a linear size while keeping the adder asymptotically fastest possible, we adopt 
a technique similar to the construction by Brent and Kung [1982], which was first used as 
a size-reduction tool by Krapchenko [1967] (see [Wegener 1987[ pp. 42-46]). 


3.1 Brent-Kung Step 


Brent and Kung [1982] construct a prefix graph recursively as shown in Figure (5^ If n is a 
at least two, it computes the n/2 intermediate results Zn o Zn-i;...; Z 2 o zi (see Section [ITT] 
for the definition of Zi). A prefix graph for these n/2 inputs is used to compute the prefixes 
^i, 2 i for all even indices z G {1,..., n/2}. For odd indices, the prefix needs to be corrected 
by one more prefix gate as Zi^ 2 i+i = Z 2 i+ioZi^ 2 i (* £ {l)---)^/2 — !})■ We call this method 
of input halving and output correction a Brent-Kung step. Note that the propagate signals 
are not needed after the correction step. Thus, we can use reduced prefix gates (Figure [6|) 
in the output correction step. In these prefix gates, the left propagate signal Xi is used only 
once. Thus, the underlying logic circuit inherits the fan-out of two from the prefix graph. 

The Brent-Kung step reduces the instance size by a factor of two, but it increases the 
depth of the construction by four and the size by 5/2n in terms of logic gates. 

Applying these Brent-Kung steps recursively, Brent and Kung obtain a prefix graph that 
has prefix gate depth 21og2 n — 1 and logic gate depth 41og2 n — 2. The prefix gate depth 
is not optimal anymore, but the adder has a comparatively small size of ^(5n — log 2 n — 8) 
gates, and its fan-out is bounded by two at all inputs and gates. It is shown in Figure Ihbl 

Brent-Kung steps were actually known before the paper by Brent and Kung [1982] e.g. 
they were already used in [Krapchenko 196'7 . But the Brent-Kung adder is based solely 
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Figure 5: Brent-Kung Step and Prefix Graph 


on these steps. 


3.2 Krapchenko’s Adder 

Krapchenko’s adder is a non-prefix adder computing all carry bits with asymptotically 
optimal depth and linear size. Its fan-out, on the other hand, is almost linear as well, 
which makes it less useful in practice. Krapchenko’s techniques can be used to derive the 
following reduction, based on Brent-Kung steps. 


Lemma 3.1 ( [Krapchenko 1967j , see [Wegener 1987 pp. 42-46]). Let r < log 2 n — 1, then 
given a family of adders computing k carry bits with depth d{k), maximum fan-out f{k) 
and size s{k), there is a family of adders computing n carry bits with depth d{n/2'^) + 4r, 
maximum fan-out max{2,/(n/2'’')} and size s{n/2'^) -\-5n. 

With size s{n/2'^) + 5.5n, we can achieve the same depth and a maximum fan-out of at 
most max{2, /(n/2'’’)}. 


Proof. We apply r Brent-Kung steps and construct the remaining adder for n/2'^ from the 
given adder family. Figure |5a| shows the situation for r = 1. The simple application of r 
Brent-Kung steps would achieve the claimed depth and fan-out result, except with at most 
2n additional 2-input prefix gates (because we will never add more prefix gates than are 
present in the Brent-Kung prefix graph) and thus with 6n additional logic gates. 

To see that 5n logic gates are enough, we show that we can omit the propagate signal 
computation for the parity-correcting part of the Brent-Kung step. Such a reduced output 
prefix gate is shown in Figure El With this construction, note that for i even, we have 
computed {x,y) = ZiO ■ ■ ■ o zi. For Zj+i = (yj+i,Xi+i), the carry bit arising from position 
f -|- 1 is Ci +2 = Xj+i V {yi+i A y), which uses two gates. It follows that a Brent-Kung step 
uses only the propagate signals at the inputs. For the next Brent-Kung step, the inputs 
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Figure 6: Reduced output correction prefix gate of a refined Brent-Kung step 


are the n/2 pairs Zn o Zn-i', ■ ■ ■; Z 2 o zi, therefore we need three logic gates per prefix gate 
for the reduction step. 

Note that in Figure IFbl the propagate signal at a gate is used if and only if there is a 
vertical line from this gate to another prefix gate (and not to an output or repeater). These 
lines exist only in the “upper half” of the adder, i. e. the parts with depth < log 2 n. Since 
parity correction occurs exclusively in the lower half with depth > log 2 n, the propagate 
signals from parity correction steps are never used. 

As in the Brent-Kung prefix graph, ^ repeaters can be used to distribute the fan-out and 
reduce the maximum fan-out of the parity-correcting gates to two (see also Figure ISb]) . □ 

The fact that the refined Brent-Kung step does not require the inner adder to provide 
the propagate signals, which a prefix graph adder would provide, allows us to use the 
multi-input generate adder with the size and depth bounds stated in Theorem 12.61 and 
which omits the last r rows of And gates (hatched gates in Figure [3|) in the augmented 
Kogge-Stone AND-prefix graph. 

Lemma O can be used to achieve different trade-offs. In particular, constructions for 
all carry bits of size up to can be turned into linear-size circuits with the same 

asymptotic depth or depth guarantee, since we could choose r = o(l)log 2 n. This works 
for prefix graphs and logic circuits; for example with r = log 2 log 2 n, the Kogge-Stone 
prefix graph will have size 3n, depth log 2 n 2 log 2 log 2 n and fan-out bounded by two in 
terms of prefix gates [Han and Carlson 1987] . 

While the technique in Lemma l3. II is essentially a 2-input prefix gate construction, the 
main result of [Krapchenko 1967] cannot be constructed using only prefix gates. 


3.3 Adders with Asymptotically Minimum Depth, Linear Size, and Fan-Out 
Two 

By combining Theorem 12.61 and Lemma 13.11 we get an adder of asymptotically minimum 
depth, linear size and with fan-out at most two. 

Theorem 3.2. There is an adder for n inputs of size bounded by 13. 5n with depth 


log2n-k8 ^/io^ 


, n 


+ 6 


log2 Vlog2 n 


+ 2 
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and maximum fan-out two. If n > 4096, the size can be bounded by 9.5n. 


Proof. We apply Lemma 13.11 with t = \+ 2 log 2 [ Y^log 2 n | ] and use an adder for 
n/2'^ inputs according to Theorem 12.61 as an inner adder. From the proof of Lemma l3.ll 
we have seen that the output correction of the Brent-Kung step does not require propagate 
signals from the inner adder. So the fan-out is indeed two. Using (|10l) . this results in an 
adder of size 


< 

< 




+ 5.5n 


g_ n _ 

’^l+2 1og2 Ipioipn] 

8 n -|- 5.5n = 13.5n. 


,|'Vlog2 "j 


[i/ioi^ 


n 


5.5n 


If n > 4096 we have nl‘IP > 16 that allows us to apply the alternative bound ([9]) to achieve 
a size bound of 9.5n. 

The depth is 


log 2 ^ + 5 I"-^log 2 ^ I + 2 -|- 4r 


= log 2 n-l5\ Y ^log 2 ^ 
< log 2 n -h 8 [ ^ylog 2 n \ 


+ 2 -|- 3t 

+ 6 [log 2 [ Vlog 2 n I ] 


+ 2 , 


where we are using r < j" ydog^] + ^ I"I" ■\/log 2 n] ] for the inequality. □ 

From Theorem 13.21 we can easily conclude our main result stated in Section [1.31 

Theorem 1.1 (Main Theorem). Given two n-bit numbers A,B, there is a logic circuit 
computing the sum A B, using gates with fan-in and fan-out two and that has depth 
log 2 n -\- o(logn) and size 0{n). 


4 Technology Mapping 


In this section we show that our construction from Theorem 13.21 can be transformed into 
an adder using only Nand/Nor, and Not gates, which are faster in current CMOS tech¬ 
nologies. This increases the depth by one and the size by a small constant factor. 


Theorem 4.1. There is an adder for n inputs using only Nand, Nor, and Not gates. 
Its size is bounded by (18 -|- ^)n, the depth is at most 


log 2 n-F 8 v^log 2 n +6 log 2 


, n 


+ 3 , 


and the maximum fan-out is two. If n > 4096, the size is bounded by (15 -|- |)n. 

In the next two sections, we show how to transform the two main components of our 
construction, the Brent-Kung steps and the multi-input multi-output generate gate adder, 
into circuits using only the desired gates. 
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4.1 Mapping Brent Kung Steps 

Brent-Kung steps can be implemented using Nand/Not prefix gates as shown in Figure [7] 
in the reduction step. Similarly, the reduced output reduction gate in Figure [6] can be 
realized by two Nand gates and one Not gate, i.e. by eliminating the two rightmost gates 
in Figure [7l The modified prefix gates do not increase the depth of the Brent-Kung step, 
and increase the size by a constant factor less than 


Vi Xi Vj Xj 



Figure 7: A Nand/Not prefix gate used in the reduction step 


4.2 Mapping Multi-Input Multi-Output Generate Adders 

We want to transform a multi-input multi-output generate gate adder using DeMorgan’s 
laws. For easier understanding, we first insert repeaters so that the gates can be arranged 
in rows, such all input signals for gates in odd rows are computed in even rows and vice 
versa. This bipartite structure is already given in the augmented Kogge-Stone AND-prefix 
graph (see Figure [3]). 

Let us now consider a multi-input generate gate shown in Figure [2j By inserting 2^/2 
repeaters gates in the last row of the AND-suffix graph, we achieve a uniform depth of this 
first stage. The red row of And gates and the final 2'’“^-output Or already have a uniform 
depth. The additional repeaters increase the size by less than a factor of |. Except for 
the first row of generate gates, the depth of the generate signals equals the depth of the 
propagate signals when they are merged in the red row of And gates. In the first row of 
generate gates, the propagate signals arrive there at depth 2r-|-l, while the generate signals 
arrive at depth r — 1 (see the proof of Lemma [2.4jl . Thus, if r is odd, we add one additional 
repeater at every generate input signal so that it arrives at an odd depth at the red level of 
And gates. Note that we can do this without increasing the overall depth, as we already 
assumed that the generate signals are delayed by r -|- 1 in the proof of Lemma [2.41 At most 
n repeaters are inserted this way. 

Some generate gates of the multi-input generate gate adder are just buffer trees, i.e. blue 
boxes in Figure 01 They have depth r — 1, which is odd if and only if the depth r -|- 1 of the 
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corresponding paths of generate signals through multi-input generate gates is odd. Thus, 
they preserve the bipartite structure. 

Now we can use the bipartite structure to transform the multi-input multi-output gener¬ 
ate adder into a circuit consisting of Nand, Nor, and Not gates. In our construction we 
will maintain the following characteristics. Inputs to an odd row, i.e. outputs of an even 
row, will be the original function values, while inputs to an even row, i.e. outputs of an odd 
row, will be the negated original function values. We achieve this by transforming gates 
as follows: Repeaters are always transformed into Not gates. In odd rows, we translate 
And gates into Nand gates and Or gates into Nor gates. In even rows, we translate And 
gates into Nor gates and Or gates into Nand gates. If the number of rows is odd, we add 
one row of Not gates to correct the otherwise negated outputs of the adder. 

Together with the n repeaters that we insert behind each generate input signal if r is odd, 
this makes 2n gates that can by accounted for by the size of the augmented Kogge-Stone 
AND-prefix graph (see Figure[3]), which is at least 3n if r > 1. Thus, the overall size of the 
generate adder increases by at most a factor of |. 

Together with the mapping of the Brent-Kung step in Section 14.II this proves Theo¬ 
rem [4Tl 


Conclusion 

We introduced the first full adder with an asymptotically optimum depth, linear size and 
a maximum fan-out of two. Asymptotically, this is twice as fast and significantly smaller 
than the Kogge-Stone adder, which is often considered the fastest adder circuit, as well as 
most other prefix graph adders. 

For small n, Theorem 13.21 will not immediately improve upon existing adders. When 
focusing on speed for small n, one would rather omit the size reduction from Section [3l 
Without the size reduction, our results in Lemma 12.41 match the depth of the Kogge-Stone 
adder for 512 inputs and improve on it for 2048 inputs, where r = 3,k = 4 yields an adder 
with depth 21 for our construction, but the adder of Kogge-Stone will have depth 22. 

Today’s microprocessors contain adders for a few hundred bits. However, adders for 2048 
bit numbers exist already today on cryptographic chips. Thus we expect that adders based 
on our ideas will find their way into hardware. 
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